Name: Sung-Jen, Yen

Course: BUDT704

Section: 0502

Date: 11/15/2023

Prince Geoge's County Food Inspection Analysis¶

As a member of Prince George's County, University of Maryland neauture thousands of students everyday, providing them meals for a day at school. Therefore, it is both our responsibility and the Prince George County's to monitor and maintain sanitary foods into our mouth.

The dataset from Prince George’s County Health Department gives us precious information on how they inspect and monitor complaint investigations. The department provides several services to residents through the Food Protection/Policy Program, including:

  1. routine inspections for food service facilities
  2. temporary or seasonal food operations

The dataset accumulates inspection records starting March 6, 2019 till November 3, 2023, with 44,600 rows and 29 columns, where each row is a food inspection. We will be mainly foucs on the records that define as restaurant, since these food providers interact the most with students on a daily basis.

Part 1: Importing and Preparing Data¶

Let's first take a look and see what our dataset looks like, so we can determine if need any preprocessing.

In [1]:
# Import data processing libraries
import numpy as np
import pandas as pd
In [2]:
# Import dataset from csv file
df = pd.read_csv('HW6_Food Inspections Nov 2023.csv')

# Count the rows and columns
print(df.shape)

# Take a look at the dataset
df.head()
(44597, 29)
Out[2]:
Establishment_id Name Category Inspection_date Inspection_results City State Zip Address_line_1 Address_line_2 ... Hot_and_cold_running_water_provided Proper_sewage_disposal No_bare_hand_contact Adequate_hand_washing_facilities Rodent_and_insects Food_contact_surfaces_and_equipment Inspection_type Owner Type Location
0 900 LONGHORN STEAKHOUSE OF BOWIE 5174 Restaurant 05/13/2020 12:00:00 AM ------ BOWIE MD 20716.0 4100 TOWN CENTER BLVD ------ ... In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance NaN Colleen Lyons Food Establishments POINT (-76.733979 38.945787)
1 16073 Anarkali Bazar Grocery Store 03/24/2020 12:00:00 AM Non-Compliant - Violations Observed GREENBELT MD 20770.0 6106 GREENBELT RD ------ ... In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance Pre-Opening Inspection YASIR RANA Food Establishments POINT (-76.908332 38.997396)
2 1207 PUPUSERIA MORAZAN Fast Food - Local 05/11/2020 12:00:00 AM ------ BRENTWOOD MD 20722.0 4309 BLADENSBURG RD ------ ... In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance Food Complaint NAPOLEON CASTRO Food Establishments POINT (-76.944736 38.937661)
3 3412 THE TENDER RIB & CATERING CO Fast Food - Local 05/09/2020 12:00:00 AM ------ SUITLAND MD 20746.0 4809 ALLENTOWN RD ------ ... In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance NaN CYNTHIA WALLACE Food Establishments POINT (-76.889999 38.818813)
4 16073 Anarkali Bazar Grocery Store 03/31/2020 12:00:00 AM Compliance Schedule - Completed GREENBELT MD 20770.0 6106 GREENBELT RD ------ ... In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance Re-inspection YASIR RANA Food Establishments POINT (-76.908332 38.997396)

5 rows × 29 columns

There are total of 44597 rows and 29 columns, which matchs the dataset description from Price George's County website. However, it seems like some of the rows are having missing values. Let's check the summary for the whole dataset.

In [3]:
# Take a look at the summary descrption of missing values in the dataset
df.isnull().sum()
Out[3]:
Establishment_id                          0
Name                                      0
Category                               1007
Inspection_date                           0
Inspection_results                        0
City                                     24
State                                     0
Zip                                       3
Address_line_1                            3
Address_line_2                            0
Food_from_approved_source                 0
Food_protected_from_contamination         0
Ill_workers_restricted                    0
Proper_hand_washing                       0
Cooling_time_and_temperature              0
Cold_holding_temperature                  0
Hot_holding_temperature                   0
Cooking_time_and_temperature              0
Reheating_time_and_temperature            0
Hot_and_cold_running_water_provided       0
Proper_sewage_disposal                    0
No_bare_hand_contact                      0
Adequate_hand_washing_facilities          0
Rodent_and_insects                        0
Food_contact_surfaces_and_equipment       0
Inspection_type                        2816
Owner                                    53
Type                                      0
Location                                190
dtype: int64

From the table, we see that Category, City, Inspection_type, Owner, and Location columns have missing values.

However, we shoud also remeber that, from previous view of the dataset, Inspection_results and Address_line2 columns are using '------' to denote missing value. Therefore, we need to switch the coulmns that use '------' to denote missing values to NaN value, so that later when we process data we don't run into errors.

In [4]:
# Replace '------' to NaN value
df.replace('------', np.nan, inplace=True)

# Chck if replace successfully
df.head()
Out[4]:
Establishment_id Name Category Inspection_date Inspection_results City State Zip Address_line_1 Address_line_2 ... Hot_and_cold_running_water_provided Proper_sewage_disposal No_bare_hand_contact Adequate_hand_washing_facilities Rodent_and_insects Food_contact_surfaces_and_equipment Inspection_type Owner Type Location
0 900 LONGHORN STEAKHOUSE OF BOWIE 5174 Restaurant 05/13/2020 12:00:00 AM NaN BOWIE MD 20716.0 4100 TOWN CENTER BLVD NaN ... In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance NaN Colleen Lyons Food Establishments POINT (-76.733979 38.945787)
1 16073 Anarkali Bazar Grocery Store 03/24/2020 12:00:00 AM Non-Compliant - Violations Observed GREENBELT MD 20770.0 6106 GREENBELT RD NaN ... In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance Pre-Opening Inspection YASIR RANA Food Establishments POINT (-76.908332 38.997396)
2 1207 PUPUSERIA MORAZAN Fast Food - Local 05/11/2020 12:00:00 AM NaN BRENTWOOD MD 20722.0 4309 BLADENSBURG RD NaN ... In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance Food Complaint NAPOLEON CASTRO Food Establishments POINT (-76.944736 38.937661)
3 3412 THE TENDER RIB & CATERING CO Fast Food - Local 05/09/2020 12:00:00 AM NaN SUITLAND MD 20746.0 4809 ALLENTOWN RD NaN ... In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance NaN CYNTHIA WALLACE Food Establishments POINT (-76.889999 38.818813)
4 16073 Anarkali Bazar Grocery Store 03/31/2020 12:00:00 AM Compliance Schedule - Completed GREENBELT MD 20770.0 6106 GREENBELT RD NaN ... In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance Re-inspection YASIR RANA Food Establishments POINT (-76.908332 38.997396)

5 rows × 29 columns

Part 2: Transforming Data¶

Before we perform analysis, we should see if any categorical columns need transformation so that they can extract more information from our data.

Since we mainly focus on restaurant records, let's first define which categories belong to 'restaurant'.

In [5]:
# Show each category from the food inspection dataset
df['Category'].unique()
Out[5]:
array(['Restaurant', 'Grocery Store', 'Fast Food - Local', 'Carry-out',
       'Convenience Store', 'Seafood', 'Fast Food - Chain', nan,
       'Institution', 'Grocery Store- Medium', 'Meat/Poultry Market',
       'Gas Station Store', 'Membership Warehouse', 'Ice Cream', 'Casino',
       'Public School', 'Coffee Shop', '@Full Service', 'Specialty Store',
       'Hotel', 'Dollar Store', 'Fast Food', 'Full Service',
       'Private School', 'Stadium/Amusement Park', 'Bakery',
       'Multiple Facilities', 'College/University', 'Catering Only',
       'Snack Bar/Concession Stand', 'Limited Service',
       'Pre-Packaged Only', 'Deli', 'Bar/Tavern/Lounge',
       'Health Care Facility', 'Buffet', 'After School Supper Program',
       'Private Club', 'Cafeteria', 'Night Club', 'Grocery Store - Large',
       'Church/Temple/Mosque', 'Fire/Community Hall',
       'Banquet Hall/Ballroom', 'Senior Nutrition Program',
       'Diet/Nutrition Site', 'Bakery/Catering', '@Fast Food-Do Not Use',
       'Full Service/Catering', 'Pizza', 'Delivery Only', 'Group Home',
       '@Fast Food', 'Farm Market', 'Micro Market',
       '@Full Service-Do Not Use', 'Diner', 'Excluded', 'Donut',
       "Farmer's Market", 'B & B', 'School'], dtype=object)

From the list, we will select only the facilities that are most common to students when having meals at. Therefore, we will not include grocery stores, markets or any other options that normally not around student activity areas. Also, we will not include bakery, coffee shop, and snack bar/concession stand because they are not in the common sense of 'restaurant'.

In [6]:
# Define the category that categorized as 'restaurant'
mask = ['Restaurant', 'Fast Food - Local', 'Carry-out', 'Seafood', 'Fast Food - Chain',
       'Fast Food', 'College/University', 'Catering Only', 'Deli', 'Bar/Tavern/Lounge', 
       'Buffet', 'Cafeteria', 'Pizza', 'Delivery Only', '@Fast Food', 'Diner', 'Donut',
       'B & B', 'School']

# Create a dataset that only contains restaurant inspection records
df_rest = df[df['Category'].isin(mask)]
df_rest.head()
Out[6]:
Establishment_id Name Category Inspection_date Inspection_results City State Zip Address_line_1 Address_line_2 ... Hot_and_cold_running_water_provided Proper_sewage_disposal No_bare_hand_contact Adequate_hand_washing_facilities Rodent_and_insects Food_contact_surfaces_and_equipment Inspection_type Owner Type Location
0 900 LONGHORN STEAKHOUSE OF BOWIE 5174 Restaurant 05/13/2020 12:00:00 AM NaN BOWIE MD 20716.0 4100 TOWN CENTER BLVD NaN ... In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance NaN Colleen Lyons Food Establishments POINT (-76.733979 38.945787)
2 1207 PUPUSERIA MORAZAN Fast Food - Local 05/11/2020 12:00:00 AM NaN BRENTWOOD MD 20722.0 4309 BLADENSBURG RD NaN ... In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance Food Complaint NAPOLEON CASTRO Food Establishments POINT (-76.944736 38.937661)
3 3412 THE TENDER RIB & CATERING CO Fast Food - Local 05/09/2020 12:00:00 AM NaN SUITLAND MD 20746.0 4809 ALLENTOWN RD NaN ... In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance NaN CYNTHIA WALLACE Food Establishments POINT (-76.889999 38.818813)
5 900 LONGHORN STEAKHOUSE OF BOWIE 5174 Restaurant 05/13/2020 12:00:00 AM NaN BOWIE MD 20716.0 4100 TOWN CENTER BLVD NaN ... In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance Food Complaint Colleen Lyons Food Establishments POINT (-76.733979 38.945787)
6 1207 PUPUSERIA MORAZAN Fast Food - Local 05/11/2020 12:00:00 AM NaN BRENTWOOD MD 20722.0 4309 BLADENSBURG RD NaN ... In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance NaN NAPOLEON CASTRO Food Establishments POINT (-76.944736 38.937661)

5 rows × 29 columns

Now we have the inspection records that contain only restaurants. Let's pivot the columns a litte bit so that we can perform summary statisitc later if needed.

First, we want to create columns for each restaurant category.

In [7]:
# Create new columns for each restaurant category
category = pd.get_dummies(df_rest['Category'])

# Concat these new columns to the restaurant dataset
df_rest = pd.concat([df_rest, category], axis=1)

# Check if new columns are created  successfully
df_rest.head()
Out[7]:
Establishment_id Name Category Inspection_date Inspection_results City State Zip Address_line_1 Address_line_2 ... Delivery Only Diner Donut Fast Food Fast Food - Chain Fast Food - Local Pizza Restaurant School Seafood
0 900 LONGHORN STEAKHOUSE OF BOWIE 5174 Restaurant 05/13/2020 12:00:00 AM NaN BOWIE MD 20716.0 4100 TOWN CENTER BLVD NaN ... 0 0 0 0 0 0 0 1 0 0
2 1207 PUPUSERIA MORAZAN Fast Food - Local 05/11/2020 12:00:00 AM NaN BRENTWOOD MD 20722.0 4309 BLADENSBURG RD NaN ... 0 0 0 0 0 1 0 0 0 0
3 3412 THE TENDER RIB & CATERING CO Fast Food - Local 05/09/2020 12:00:00 AM NaN SUITLAND MD 20746.0 4809 ALLENTOWN RD NaN ... 0 0 0 0 0 1 0 0 0 0
5 900 LONGHORN STEAKHOUSE OF BOWIE 5174 Restaurant 05/13/2020 12:00:00 AM NaN BOWIE MD 20716.0 4100 TOWN CENTER BLVD NaN ... 0 0 0 0 0 0 0 1 0 0
6 1207 PUPUSERIA MORAZAN Fast Food - Local 05/11/2020 12:00:00 AM NaN BRENTWOOD MD 20722.0 4309 BLADENSBURG RD NaN ... 0 0 0 0 0 1 0 0 0 0

5 rows × 48 columns

Now we have different columns for each restaurant category, let's take a look at how each category perfroms in inspection results.

In [8]:
# Creaete summary statistic on inspection result (in percentage)
proportion = df_rest['Inspection_results'].value_counts(normalize=True)

# Print the percentage of restaurants that is currently not in complianc
non_compliance = proportion[~proportion.index.str.match('^Compliance|^Compliant')]
non_compliance
Out[8]:
Critical Violations observed           0.370257
Non-Compliant - Violations Observed    0.230753
Facility Closed                        0.019793
Facility Reopened                      0.010238
No Critical Violations Observed        0.000341
Name: Inspection_results, dtype: float64

From the table we can see that nearly 63% of restaurants are not in compliance to food insepction. Of all the non compliance situation, critical violations accounts over half of the non compliance. This might influence consumers whether or not to dine at PG County's restaurants.

To be able to conduct summary statisitc based on year and month, let's create a new date columns from the original inspection date column.

In [9]:
# Slicing year and month information into the new column 
df_rest['Date'] = df_rest['Inspection_date'].str.extract(r'(\d{4})')  + '/' +  df_rest['Inspection_date'].str.extract(r'(^\d{2})')

# Change the data column from string to time series format for later data anlysis
df_rest['Date'] = pd.to_datetime(df_rest['Date'])

# Create a new dataset that contains the inspection date time series data
df_trans = df_rest.sort_values(by='Date')

# Check the latest data entry and see if match the data souce document description
df_trans.tail()
Out[9]:
Establishment_id Name Category Inspection_date Inspection_results City State Zip Address_line_1 Address_line_2 ... Diner Donut Fast Food Fast Food - Chain Fast Food - Local Pizza Restaurant School Seafood Date
44578 794 KENNYS SUB SHOP Carry-out 11/02/2023 12:00:00 AM NaN GREENBELT MD 20770.0 8823 GREENBELT RD NaN ... 0 0 0 0 0 0 0 0 0 2023-11-01
44591 8408 ATOMIC WINGS Fast Food - Local 11/02/2023 12:00:00 AM Non-Compliant - Violations Observed HYATTSVILLE MD 20782.0 3124 Queens Chapel RD NaN ... 0 0 0 0 1 0 0 0 0 2023-11-01
44592 1211 QUEENSWAY RESTAURANT Fast Food - Local 11/02/2023 12:00:00 AM Non-Compliant - Violations Observed RIVERDALE MD 20737.0 5851 RIVERDALE RD NaN ... 0 0 0 0 1 0 0 0 0 2023-11-01
44571 6154 AUNTIE ANNE'S PLANET SMOOTHIE Fast Food - Chain 11/01/2023 12:00:00 AM Critical Violations observed OXON HILL MD 20745.0 6800 OXON HILL RD 895 ... 0 0 0 1 0 0 0 0 0 2023-11-01
8645 8128 WINGSTOP Fast Food - Chain 05/17/2026 12:00:00 AM Critical Violations observed LAUREL MD 20707.0 14623 BALTIMORE AVE NaN ... 0 0 0 1 0 0 0 0 0 2026-05-01

5 rows × 49 columns

There shouldn't be any record that has inspection date later than 2023, so we should drop that row.

In [10]:
# Drop the row that does not comply to the data recorded time span
df_trans.drop(df_trans.index[-1], inplace=True)

# Check if drop successfully
df_trans.tail()
Out[10]:
Establishment_id Name Category Inspection_date Inspection_results City State Zip Address_line_1 Address_line_2 ... Diner Donut Fast Food Fast Food - Chain Fast Food - Local Pizza Restaurant School Seafood Date
44582 1467 SUBWAY SANDWICH SHOP Fast Food - Chain 11/02/2023 12:00:00 AM NaN GREENBELT MD 20770.0 6094 GREENBELT RD NaN ... 0 0 0 1 0 0 0 0 0 2023-11-01
44578 794 KENNYS SUB SHOP Carry-out 11/02/2023 12:00:00 AM NaN GREENBELT MD 20770.0 8823 GREENBELT RD NaN ... 0 0 0 0 0 0 0 0 0 2023-11-01
44591 8408 ATOMIC WINGS Fast Food - Local 11/02/2023 12:00:00 AM Non-Compliant - Violations Observed HYATTSVILLE MD 20782.0 3124 Queens Chapel RD NaN ... 0 0 0 0 1 0 0 0 0 2023-11-01
44592 1211 QUEENSWAY RESTAURANT Fast Food - Local 11/02/2023 12:00:00 AM Non-Compliant - Violations Observed RIVERDALE MD 20737.0 5851 RIVERDALE RD NaN ... 0 0 0 0 1 0 0 0 0 2023-11-01
44571 6154 AUNTIE ANNE'S PLANET SMOOTHIE Fast Food - Chain 11/01/2023 12:00:00 AM Critical Violations observed OXON HILL MD 20745.0 6800 OXON HILL RD 895 ... 0 0 0 1 0 0 0 0 0 2023-11-01

5 rows × 49 columns

Now we have the correct date column, we might want to know the violations for each restaurants. It would be much easier to count the number of violations if we can change the compliance result into 0 and 1, where 0 means compliance and 1 means non compliance.

In [11]:
# Check if every columns ony has 'In Compliance' and 'Out of Compliance' as categorical value.
df_trans.loc[:, 'Food_from_approved_source':'Food_contact_surfaces_and_equipment'].describe()
Out[11]:
Food_from_approved_source Food_protected_from_contamination Ill_workers_restricted Proper_hand_washing Cooling_time_and_temperature Cold_holding_temperature Hot_holding_temperature Cooking_time_and_temperature Reheating_time_and_temperature Hot_and_cold_running_water_provided Proper_sewage_disposal No_bare_hand_contact Adequate_hand_washing_facilities Rodent_and_insects Food_contact_surfaces_and_equipment
count 21385 21385 21385 21385 21385 21385 21385 21385 21385 21385 21385 21385 21385 21385 21385
unique 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
top In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance In Compliance
freq 21342 20513 21371 20578 21160 17259 19532 21295 21315 21230 21340 21323 21041 17687 17068
In [12]:
# Switch 'In compliance' to 1 and 'Out of Compliance' to 0
df_trans.replace({'In Compliance': 0, 'Out of Compliance': 1}, inplace=True)

# # Check if replace values successfully
df_trans.loc[:, 'Food_from_approved_source':'Food_contact_surfaces_and_equipment'].head()
Out[12]:
Food_from_approved_source Food_protected_from_contamination Ill_workers_restricted Proper_hand_washing Cooling_time_and_temperature Cold_holding_temperature Hot_holding_temperature Cooking_time_and_temperature Reheating_time_and_temperature Hot_and_cold_running_water_provided Proper_sewage_disposal No_bare_hand_contact Adequate_hand_washing_facilities Rodent_and_insects Food_contact_surfaces_and_equipment
29062 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
16878 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0
10110 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0
27477 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
11170 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
In [13]:
# Create a new column to count the number of violation
df_trans['Number_of_violation'] = df_trans.loc[:, 'Food_from_approved_source':'Food_contact_surfaces_and_equipment'].sum(axis=1)

# Check if the column is created successfully
df_trans.head()
Out[13]:
Establishment_id Name Category Inspection_date Inspection_results City State Zip Address_line_1 Address_line_2 ... Donut Fast Food Fast Food - Chain Fast Food - Local Pizza Restaurant School Seafood Date Number_of_violation
29062 1088 PANDA EXPRESS 1793 Fast Food 07/21/2011 12:00:00 AM NaN BRANDYWINE MD 20613.0 15902 CRAIN HWY NaN ... 0 1 0 0 0 0 0 0 2011-07-01 0
16878 43 ANDREWS RESTAURANT Restaurant 07/13/2011 12:00:00 AM NaN SUITLAND MD 20746.0 6407 SUITLAND RD NaN ... 0 0 0 0 0 1 0 0 2011-07-01 2
10110 561 GAYLORD NATIONAL - OLD HICKORY STEAKHOUSE Restaurant 07/07/2011 12:00:00 AM NaN OXON HILL MD 20745.0 201 WATERFRONT ST NaN ... 0 0 0 0 0 1 0 0 2011-07-01 1
27477 1867 EDIBLE ARRANGEMENTS Catering Only 07/20/2011 12:00:00 AM NaN LAUREL MD 20707.0 13600 BALTIMORE AVE 204 ... 0 0 0 0 0 0 0 0 2011-07-01 0
11170 289 CHICK-FIL-A Fast Food 07/21/2011 12:00:00 AM NaN BRANDYWINE MD 20613.0 16003 CRAIN HWY NaN ... 0 1 0 0 0 0 0 0 2011-07-01 0

5 rows × 50 columns

In [14]:
# Create a new dataset in wide foramt and contains only establishment ID, Category, Inspection_date, and number of violation columns
df_wide = df_trans[['Establishment_id', 'Category', 'Inspection_date', 'Number_of_violation']]
df_wide = df_wide.set_index(['Establishment_id', 'Category']).sort_index()
df_wide
Out[14]:
Inspection_date Number_of_violation
Establishment_id Category
2 Carry-out 07/19/2011 12:00:00 AM 2
Carry-out 11/20/2012 12:00:00 AM 1
Carry-out 12/03/2012 12:00:00 AM 0
Carry-out 04/15/2014 12:00:00 AM 0
Carry-out 10/22/2014 12:00:00 AM 1
... ... ... ...
16291 Restaurant 06/16/2023 12:00:00 AM 1
16309 Fast Food - Chain 12/02/2021 12:00:00 AM 1
Fast Food - Chain 06/28/2022 12:00:00 AM 0
Fast Food - Chain 08/11/2022 12:00:00 AM 0
16837 Fast Food - Chain 04/06/2023 12:00:00 AM 0

21385 rows × 2 columns

Now we have successfully pivot out columns, we can start our analysis.

Part 3: Analyzing Data¶

The Most Common Violation¶

First and foremost, we want to know which catagory of viloations happen the most. This is important becasue we want to help PG county find the top violations so that they can tackle the sanitation problems that is most often complained.

In [15]:
# Count the frequency of different categories violation
violation = df_trans.loc[:, 'Food_from_approved_source':'Food_contact_surfaces_and_equipment'].sum().sort_values(ascending=False)
violation
Out[15]:
Food_contact_surfaces_and_equipment    4317
Cold_holding_temperature               4126
Rodent_and_insects                     3698
Hot_holding_temperature                1853
Food_protected_from_contamination       872
Proper_hand_washing                     807
Adequate_hand_washing_facilities        344
Cooling_time_and_temperature            225
Hot_and_cold_running_water_provided     155
Cooking_time_and_temperature             90
Reheating_time_and_temperature           70
No_bare_hand_contact                     62
Proper_sewage_disposal                   45
Food_from_approved_source                43
Ill_workers_restricted                   14
dtype: int64
In [16]:
# Import visualization libraries
import matplotlib.pyplot as plt 
import seaborn as sns
In [17]:
# Create visaulization of frequency ditibution
axis = sns.barplot(x=violation.values, y=violation.index, palette=['Blue'], orient='h')
axis.set_title('Frequeny Distribution of Violations', fontdict={'fontsize': 20, 'fontweight': 700, 'color': 'maroon'}, pad=20)
plt.xlabel('Count of Violations')
plt.xticks(rotation='horizontal')

for p in axis.patches:
    width = p.get_width()
    axis.annotate(f'{width:.0f}',xy=(width, p.get_y() + p.get_height() / 2), ha='left', va='center')

From the table, we can see that Food_contact_surfaces_and_equipment, Cold_holding_temperature, and rodent_and_insects are the most common violations. It is clear that unhygienic food processing environments are prevalent among restaurants. This is bad, since food often spoil easily in high temperature or unhygienic surface where bacteria loves to grow.

Therefore, we would suggest PG county put effort on enforcing equipment temperatue check and kitchen sanitising on routine basis.

The Violation Rate Over Time¶

Now we know the most common violation. Let's see how has the violation rate change. We want to know the proportion of restaurants that violate more than one inpsection by month and year. And see if there is any trend that inspections are getting harder or easier over time.

In [18]:
# Create a new dataset to count all the restaurants that have more than one vilation by month and year.
df_trans2 = df_trans.loc[df_trans['Number_of_violation'] >= 1, ['Date', 'Number_of_violation']]

# Calcualte the violation rate in each month by year
percentage = df_trans2.groupby('Date')['Number_of_violation'].count() / df_trans[['Date', 'Number_of_violation']].groupby('Date')['Number_of_violation'].count()
percentage
Out[18]:
Date
2011-07-01    0.512195
2011-08-01    0.389831
2011-09-01    0.295455
2011-10-01    0.304348
2011-11-01    0.366667
                ...   
2023-07-01    0.556818
2023-08-01    0.442105
2023-09-01    1.000000
2023-10-01    0.350877
2023-11-01    0.200000
Name: Number_of_violation, Length: 149, dtype: float64
In [19]:
# Transform the dataset to have seperate year and month column for data visualizatin
percentage = percentage.to_frame()
percentage.reset_index(inplace=True)
percentage['Year'] = percentage['Date'].dt.year
percentage['Month'] = percentage['Date'].dt.month
percentage
Out[19]:
Date Number_of_violation Year Month
0 2011-07-01 0.512195 2011 7
1 2011-08-01 0.389831 2011 8
2 2011-09-01 0.295455 2011 9
3 2011-10-01 0.304348 2011 10
4 2011-11-01 0.366667 2011 11
... ... ... ... ...
144 2023-07-01 0.556818 2023 7
145 2023-08-01 0.442105 2023 8
146 2023-09-01 1.000000 2023 9
147 2023-10-01 0.350877 2023 10
148 2023-11-01 0.200000 2023 11

149 rows × 4 columns

In [20]:
# Import interactive data visualizations library
import plotly.express as px

# Create visualization for violation rate over time
fig = px.line(percentage, x="Month", y="Number_of_violation", color='Year', title='Restaurant with At Least One Violation by Month')
fig.update_xaxes(tickmode='array', tickvals=list(range(1, 13)), 
                 ticktext=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])
fig.show()

*Hint: select the years needed for comparison

If we look at the years (for instance 2017 to 2020) before COVID, we can see that the percentage of restuarants having more than one violations decrease gradually by year. It seems like the inspection becomes easier over years.

However, if you add in 2020 and 2021 into consideration, we can notice that the violation proportion rise again. This may due to the PG county precaution for each restaurant to prevent COVID spread out; therefore, perform harder inpection on sanitation.

In addition, it seems like April tend to have lower violations. This may due to two possible reasons:

  1. Since April is still just the begin of the spring season, the food supplier or students may still in preparation or not yet returning to classes. Therfore, the food consuming is usually low at this time of period.
  2. Inspectors may tend to check during semester where food comsumption volume is high; therefore, more events being reported.

To prove our point we should check the corresponding data at that same period of time. Suppose we take a look at the 2020 data.

In [21]:
# Check the number of inspections for April from year 2017 to 2020
for i in range(2017, 2021):
    df_year = df_trans[df_trans['Date'].dt.year == i]
    print(df_year.groupby('Date')['Establishment_id'].count().sort_values())
Date
2017-04-01    183
2017-12-01    193
2017-09-01    202
2017-01-01    209
2017-02-01    212
2017-11-01    219
2017-10-01    230
2017-03-01    240
2017-05-01    258
2017-07-01    259
2017-08-01    272
2017-06-01    279
Name: Establishment_id, dtype: int64
Date
2018-12-01    125
2018-04-01    127
2018-10-01    128
2018-09-01    144
2018-03-01    152
2018-02-01    154
2018-08-01    157
2018-11-01    168
2018-05-01    193
2018-06-01    193
2018-07-01    206
2018-01-01    243
Name: Establishment_id, dtype: int64
Date
2019-01-01    109
2019-02-01    127
2019-09-01    142
2019-04-01    153
2019-05-01    153
2019-10-01    162
2019-03-01    169
2019-07-01    184
2019-06-01    208
2019-08-01    252
2019-11-01    262
2019-12-01    280
Name: Establishment_id, dtype: int64
Date
2020-04-01      8
2020-11-01     22
2020-05-01     25
2020-10-01     41
2020-12-01     43
2020-09-01     48
2020-08-01     49
2020-06-01     67
2020-07-01    158
2020-03-01    194
2020-02-01    271
2020-01-01    303
Name: Establishment_id, dtype: int64

We can see that it did match our assumption, low violation rate does not neccessarily means restaurants are performing better. It might be a result of low inspection instance!

The Map for the Restaurants¶

Lastly, we want to know are there any particular areas in PG county with more violations. This is important to us, as students, we should care about how our surrounding areas' restaurant perform, as we interact the most with them on a daily basis.

Let's say we want to investigate all the restaurents in PG county this year (2023) performance.

In [22]:
# Extract longitude and latitude information for the map visulaization
df_trans['Longitude'] = df_trans['Location'].str.extract(r'(-?\d+.\d+)').astype(float)
df_trans['Latitude'] = df_trans['Location'].str.extract(r' (\d+.\d+)').astype(float)

# Denote mark down color category for map visualization
df_trans['Mark'] = df_trans['Number_of_violation'].map(lambda x: 'More than 1 violation' if x >= 1 else 'No violation')
df_trans.head()
Out[22]:
Establishment_id Name Category Inspection_date Inspection_results City State Zip Address_line_1 Address_line_2 ... Fast Food - Local Pizza Restaurant School Seafood Date Number_of_violation Longitude Latitude Mark
29062 1088 PANDA EXPRESS 1793 Fast Food 07/21/2011 12:00:00 AM NaN BRANDYWINE MD 20613.0 15902 CRAIN HWY NaN ... 0 0 0 0 0 2011-07-01 0 -76.875918 38.673414 No violation
16878 43 ANDREWS RESTAURANT Restaurant 07/13/2011 12:00:00 AM NaN SUITLAND MD 20746.0 6407 SUITLAND RD NaN ... 0 0 1 0 0 2011-07-01 2 -76.895252 38.829488 More than 1 violation
10110 561 GAYLORD NATIONAL - OLD HICKORY STEAKHOUSE Restaurant 07/07/2011 12:00:00 AM NaN OXON HILL MD 20745.0 201 WATERFRONT ST NaN ... 0 0 1 0 0 2011-07-01 1 -77.016084 38.782417 More than 1 violation
27477 1867 EDIBLE ARRANGEMENTS Catering Only 07/20/2011 12:00:00 AM NaN LAUREL MD 20707.0 13600 BALTIMORE AVE 204 ... 0 0 0 0 0 2011-07-01 0 -76.868105 39.076438 No violation
11170 289 CHICK-FIL-A Fast Food 07/21/2011 12:00:00 AM NaN BRANDYWINE MD 20613.0 16003 CRAIN HWY NaN ... 0 0 0 0 0 2011-07-01 0 -76.876294 38.670276 No violation

5 rows × 53 columns

In [23]:
# Create a new dataset that only contains 2023 inspection records
df_2023 = df_trans[df_trans['Date'].dt.year == 2023]
In [24]:
# Create a map that shows all the restaurant in 2023
fig = px.scatter_mapbox(df_2023, lat="Latitude", lon="Longitude", hover_name="Name", hover_data=["Category", "Address_line_1"],
                        color='Mark', color_discrete_sequence=["blue", "red"], zoom=10, height=500)
fig.update_layout(title="2023 PG County Restaurant Performance", mapbox_style="carto-positron")
fig.update_layout(margin={"r":0,"t":0,"l":0,"b":0})
fig.show()

From the map, it looks like the further we are away from unrban area, the restaurants with no violation become more visible. This seems resonable, as downtown and close to city region tend to have more customers every day; therefore, it could be easily to neglect on maintaining sanitation.

Also, if we zoom in a liitle bit and take a look at College Park area, where the Univesity of Maryland stands at, we can see that there seems to be fairly amount of restaurants comply to the PG county food inspection. The most plausable reason will be that the restaurants near campus need to follow not only PG county standard but also the university standard.

Although we kind of know the distribution of restaurants having more than 1 violation, but we want to dig even deeper. We would like to know exactly how sivere they are and whether there is some clustering among them. For instance, which category of restaurant tend to have more violation. Again, here we are going to use 2023 data as our investigation sample.

In [25]:
# Create a map that shows all the restaurant in 2023 
# Warm tune means higher number of violations and cold tune means lower number of violations
px.set_mapbox_access_token("pk.eyJ1Ijoic3llbjEyMzUiLCJhIjoiY2xveXltMzVuMDhoeTJpb3A0a2VlYjM0cyJ9.I4ikCf8sBEMgHP1heJfOKQ")

fig = px.scatter_mapbox(df_2023, lat="Latitude", lon="Longitude", hover_name="Name", hover_data=["Category"], color="Number_of_violation", size="Number_of_violation",
                  color_continuous_scale=px.colors.cyclical.IceFire, size_max=15, zoom=10)
fig.show()

If we hover on the points where circle are red and orange, it seems like restaurant categories like Fast food and Carry out tend to have higher violation. Also, if we use lasso select Landover area, it seems like Fast food and Carry out restaurants have some cluster over that region. This kind of make sense, since fast food and carry out restaurants focus more on service efficiency and less on the overall quality of food presentation. Therefore, it is no suprise that they have a hard time to maintain high sanitary standards.

"I pledge on my honor that I have not given nor received any unauthorized assistance on this assignment."

--Sung-Jen, Yen